Deep Learning Math  ·  The Lab

The Tiny Language Model Lab

A real language model — small enough to live in this page. Give it words, train it with your own hands, and watch it learn to spell. Everything below runs live, on your device.

it's a bigram model — the simplest kind of language model, predating transformers by decades. it looks at one letter to guess the next. real LLMs are transformers; for that, see the Tiny Transformer Lab.

the whole assembly line — text becomes a model
Text
corpus
Tokenize
letters → ids
Vector
Ch.4 one-hot
× Matrix
Ch.5 scores
Softmax
Ch.6 chances
Loss
Ch.6 surprise
↓ Gradient
Ch.2 learn
Generate
Ch.6 sample
Feedback
mock RLHF
↻ steps 3–7 repeat thousands of times while training · each card below is one stage
01

The corpus

Everything it will ever know. Edit it, or pick a starter set, then press Load.

{{ V }} letters {{ wordsN }} words {{ bigrams }} letter-pairs
teach it more

appends to the corpus above and keeps training — add a few, train more, watch it shift.

02

Train it

Ch.2 gradient descent

Each step nudges the weights downhill to make the real next letter less surprising. Watch the loss fall.

step {{ step }} loss {{ lossDisp }} / start {{ uniform }}
learn rate {{ lrDisp }}

Start at the boundary, sample a letter, repeat until it stops. Low temperature plays it safe; high gets weird.

temp {{ tempDisp }}
{{ s }}
prompt it
{{ promptEcho }}{{ continuationTail }}

it only remembers the last letter you gave it, so it riffs more than it answers — that's what makes it a bigram, not a chatbot.

04

Peek inside

Ch.5 the weight matrix

The model is this grid. Row = the current letter, column = the next. Brighter means “more likely.” Watch patterns sharpen as it trains.

click any row to inspect that letter in “Follow the math” below.

05

Follow the math

Every training step runs this exact little pipeline for each letter. Pick one and watch text turn into a vector, the vector pick a row of the matrix, and a guess fall out — the strips update live as the loss drops.

trace the letter…
Ch.4 vector 1 · the letter “{{ flowInput }}” becomes a one-hot vector

a single 1 in its own slot, 0 everywhere else — one cell per letter (a…z, then “end”).

Ch.5 matrix 2 · that vector picks row “{{ flowInput }}” of the weight matrix → raw scores

one score per possible next letter. teal = the model leans toward it, red = leans away. (this row is the highlighted stripe in the heatmap above.)

Ch.6 softmax 3 · softmax squashes scores into probabilities

now they add to 100%. the model's top guess after “{{ flowInput }}” is “{{ flowTop }}”.

softmax, by hand — each chance = e^score ÷ (sum of all e^scores = {{ sumExp }})
{{ s.ch }} score{{ s.logit }} → e^score{{ s.exp }} {{ s.pct }}
Ch.6 + Ch.2 learn 4 · compare to reality, then nudge the weights downhill
in the data, “{{ flowInput }}” is usually followed by “{{ flowTarget }} model gives it {{ flowPTarget }} surprise (loss) {{ flowLoss }}

gradient descent adjusts this row so the real next letter's bar grows and the surprise shrinks. press Train and watch the teal stripe in step 3 swell under “{{ flowTarget }}.”

06

Human feedback

mock RLHF

Real models get tuned by human preference: people rate outputs, and training makes the liked ones more likely. Here's a toy version — sample some words, give a or , then apply it. The model literally re-weights the letter-pairs in the words you rated.

{{ f.word }}{{ f.mark }}

after applying, watch the heatmap shift and re-sample — liked spellings get more likely, disliked ones fade. that's preference tuning in miniature (real RLHF is fancier, but the spirit is this).

That's a language model. Real training, real sampling — just very, very small.

A frontier LLM is this same loop, with billions of weights, a context of thousands of tokens, and the whole internet as its corpus. The math under your fingertips here is the math under all of it.